AITopics | word problem

Collaborating Authors

word problem

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Structured Reasoning with Tree-of-Thoughts for Bengali Math Word Problems

Mahmood, Aurprita, alam, Sabrin, Sagor, Neloy kumer, Hadi, Md. Abdul, Islam, Md. Sehab Al, Islam, Minhajul

arXiv.org Artificial IntelligenceDec-8-2025

Mathematical Word Problems (MWPs) are among the most challenging tasks in natural language processing because they require both linguistic understanding and multi-step numerical reasoning. While Chain-of-Thought (CoT) prompting has shown promise, its linear structure often propagates errors, limiting overall effectiveness. To address this limitation, we present the a systematic study of Tree-of-Thought (ToT) reasoning for Bengali MWPs using the SOMADHAN dataset. Owing to computational and token-cost constraints, we evaluate a curated set of 100 representative problems across multiple large language models (LLMs), including GPT-OSS and LLaMA variants, under standard prompting, CoT, and ToT strategies. Our results show that CoT improves baseline accuracy from 78% (standard prompting) to 83% on average, while ToT further increases performance by up to 5 percentage points, achieving 88% accuracy with GPT-OSS-120B. These improvements highlight that ToT is particularly effective in medium-to-large-scale models but may offer less advantage for smaller ones. Overall, our findings establish ToT as a robust framework for solving mathematical problems in low-resource languages such as Bengali. More broadly, this study shows that structured reasoning methods like ToT can provide more reliable and globally consistent outcomes than CoT, paving the way for better reasoning strategies in multilingual NLP.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2512.0558

Country: Asia > Bangladesh (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback

Foundation of Intelligence: Review of Math Word Problems from Human Cognition Perspective

Huang, Zhenya, Liu, Jiayu, Lin, Xin, Ma, Zhiyuan, Xue, Shangzi, Xiao, Tong, Liu, Qi, Teh, Yee Whye, Chen, Enhong

arXiv.org Artificial IntelligenceOct-28-2025

Math word problem (MWP) serves as a fundamental research topic in artificial intelligence (AI) dating back to 1960s. This research aims to advance the reasoning abilities of AI by mirroring the human-like cognitive intelligence. The mainstream technological paradigm has evolved from the early rule-based methods, to deep learning models, and is rapidly advancing towards large language models. However, the field still lacks a systematic taxonomy for the MWP survey along with a discussion of current development trends. Therefore, in this paper, we aim to comprehensively review related research in MWP solving through the lens of human cognition, to demonstrate how recent AI models are advancing in simulating human cognitive abilities. Specifically, we summarize 5 crucial cognitive abilities for MWP solving, including Problem Understanding, Logical Organization, Associative Memory, Critical Thinking, and Knowledge Learning. Focused on these abilities, we review two mainstream MWP models in recent 10 years: neural network solvers, and LLM based solvers, and discuss the core human-like abilities they demonstrated in their intricate problem-solving process. Moreover, we rerun all the representative MWP solvers and supplement their performance on 5 mainstream benchmarks for a unified comparison. To the best of our knowledge, this survey first comprehensively analyzes the influential MWP research of the past decade from the perspective of human reasoning cognition and provides an integrative overall comparison across existing approaches. We hope it can inspire further research in AI reasoning. Our repository is released on https://github.com/Ljyustc/FoI-MWP.

large language model, logic & formal reasoning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2510.21999

Country:

Asia (0.67)
Europe > United Kingdom > England (0.27)

Genre:

Workflow (1.00)
Research Report (1.00)
Overview (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

EDUMATH: Generating Standards-aligned Educational Math Word Problems

Christ, Bryan R., Molitz, Penelope, Kropko, Jonathan, Hartvigsen, Thomas

arXiv.org Artificial IntelligenceOct-9-2025

Math word problems (MWPs) are critical K-12 educational tools, and customizing them to students' interests and ability levels can increase learning outcomes. However, teachers struggle to find time to customize MWPs for each student given large class sizes and increasing burnout. We propose that LLMs can support math education by generating MWPs customized to student interests and math education standards. To this end, we use a joint human expert-LLM judge approach to evaluate over 11,000 MWPs generated by open and closed LLMs and develop the first teacher-annotated dataset for standards-aligned educational MWP generation. We show the value of our data by using it to train a 12B open model that matches the performance of larger and more capable open models. We also use our teacher-annotated data to train a text classifier that enables a 30B open LLM to outperform existing closed baselines without any training. Next, we show our models' MWPs are more similar to human-written MWPs than those from existing models. We conclude by conducting the first study of customized LLM-generated MWPs with grade school students, finding they perform similarly on our models' MWPs relative to human-written MWPs but consistently prefer our customized MWPs.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.06965

Country:

North America > United States (1.00)
Asia (0.67)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment > Sports (1.00)
Education > Educational Setting > K-12 Education (0.86)
Leisure & Entertainment > Games > Computer Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages

Azime, Israel Abebe, Belay, Tadesse Destaw, Klakow, Dietrich, Slusallek, Philipp, Chhabra, Anshuman

arXiv.org Artificial IntelligenceOct-8-2025

Large language models (LLMs) have demonstrated significant capabilities in solving mathematical problems expressed in natural language. However, multilingual and culturally-grounded mathematical reasoning in low-resource languages lags behind English due to the scarcity of socio-cultural task datasets that reflect accurate native entities such as person names, organization names, and currencies. Existing multilingual benchmarks are predominantly produced via translation and typically retain English-centric entities, owing to the high cost associated with human annotater-based localization. Moreover, automated localization tools are limited, and hence, truly localized datasets remain scarce. To bridge this gap, we introduce a framework for LLM-driven cultural localization of math word problems that automatically constructs datasets with native names, organizations, and currencies from existing sources. We find that translated benchmarks can obscure true multilingual math ability under appropriate socio-cultural contexts. Through extensive experiments, we also show that our framework can help mitigate English-centric entity bias and improves robustness when native entities are introduced across various languages.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2508.14913

Country:

North America > Mexico (0.28)
North America > United States > Florida (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

Wei, Hu, Xu, Ze, Yang, Boyu, Miao, Linlin, Zhai, Weiqi, Li, Yihan, Li, Zixuan, Wang, Zhijun, Wang, Boya, Yu, Jianwei, Yuan, Jialing, Zhang, Xiaoyue, He, Cheng, Chen, Minglei, Zhang, Zifan, Li, Qianhui, Wang, Wei, Xu, Xiang

arXiv.org Artificial IntelligenceOct-3-2025

Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.01241

Genre: Research Report (0.64)

Industry: Education (0.74)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems

Gonzalez, Xavier, Buchanan, E. Kelly, Lee, Hyun Dong, Liu, Jerry Weihong, Wang, Ke Alexander, Zoltowski, David M., Ré, Christopher, Linderman, Scott W.

arXiv.org Artificial IntelligenceSep-29-2025

Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. This unifying view highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, our framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.21716

Country: North America > United States (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

Add feedback

Solving Math Word Problems Using Estimation Verification and Equation Generation

Piehl, Mitchell, Wilson, Dillon, Kalita, Ananya, Kalita, Jugal

arXiv.org Artificial IntelligenceSep-24-2025

Large Language Models (LLMs) excel at various tasks, including problem-solving and question-answering. However, LLMs often find Math Word Problems (MWPs) challenging because solving them requires a range of reasoning and mathematical abilities with which LLMs seem to struggle. Recent efforts have helped LLMs solve more complex MWPs with improved prompts. This study proposes a novel method that initially prompts an LLM to create equations from a decomposition of the question, followed by using an external symbolic equation solver to produce an answer. To ensure the accuracy of the obtained answer, inspired by an established recommendation of math teachers, the LLM is instructed to solve the MWP a second time, but this time with the objective of estimating the correct answer instead of solving it exactly. The estimation is then compared to the generated answer to verify. If verification fails, an iterative rectification process is employed to ensure the correct answer is eventually found. This approach achieves new state-of-the-art results on datasets used by prior published research on numeric and algebraic MWPs, improving the previous best results by nearly two percent on average. In addition, the approach obtains satisfactory results on trigonometric MWPs, a task not previously attempted to the authors' best knowledge. This study also introduces two new datasets, SVAMPClean and Trig300, to further advance the testing of LLMs' reasoning abilities.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.18565

Country: North America > United States > Colorado (0.14)

Genre:

Research Report > Promising Solution (0.67)
Research Report > New Finding (0.46)

Industry:

Education > Curriculum > Subject-Specific Education (0.54)
Education > Educational Setting > K-12 Education (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

DART: Distilling Autoregressive Reasoning to Silent Thought

Jiang, Nan, Wu, Ziming, Zhan, De-Chuan, Lai, Fuming, Lian, Shaobing

arXiv.org Artificial IntelligenceAug-29-2025

Chain-of-Thought (CoT) reasoning has significantly advanced Large Language Models (LLMs) in solving complex tasks. However, its autoregressive paradigm leads to significant computational overhead, hindering its deployment in latency-sensitive applications. To address this, we propose \textbf{DART} (\textbf{D}istilling \textbf{A}utoregressive \textbf{R}easoning to Silent \textbf{T}hought), a self-distillation framework that enables LLMs to replace autoregressive CoT with non-autoregressive Silent Thought (ST). Specifically, DART introduces two training pathways: the CoT pathway for traditional reasoning and the ST pathway for generating answers directly from a few ST tokens. The ST pathway utilizes a lightweight Reasoning Evolvement Module (REM) to align its hidden states with the CoT pathway, enabling the ST tokens to evolve into informative embeddings. During inference, only the ST pathway is activated, leveraging evolving ST tokens to deliver the answer directly. Extensive experimental results demonstrate that DART offers significant performance gains compared with existing non-autoregressive baselines without extra inference latency, serving as a feasible alternative for efficient reasoning.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.11752

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

A New NMT Model for Translating Clinical Texts from English to Spanish

Li, Rumeng, Wang, Xun, Yu, Hong

arXiv.org Artificial IntelligenceAug-27-2025

Translating electronic health record (EHR) narratives from English to Spanish is a clinically important yet challenging task due to the lack of a parallel-aligned corpus and the abundant unknown words contained. To address such challenges, we propose \textbf{NOOV} (for No OOV), a new neural machine translation (NMT) system that requires little in-domain parallel-aligned corpus for training. NOOV integrates a bilingual lexicon automatically learned from parallel-aligned corpora and a phrase look-up table extracted from a large biomedical knowledge resource, to alleviate both the unknown word problem and the word-repeat challenge in NMT, enhancing better phrase generation of NMT systems. Evaluation shows that NOOV is able to generate better translation of EHR with improvement in both accuracy and fluency.

artificial intelligence, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2508.18607

Country: North America > United States > Massachusetts (0.94)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective

Strohmaier, Anselm R., Van Dooren, Wim, Seßler, Kathrin, Greer, Brian, Verschaffel, Lieven

arXiv.org Artificial IntelligenceAug-12-2025

Preprint August 2025 - This version has not been peer - reviewed . Abstract The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education. One hope is that they can support mathematics learning, including word - problem solving. Since LLMs can handle textual input with ease, they appear well - suited for solving mathematical word problems. Yet their real competence, whether they can make sense of the real - world context, and the implications for classrooms remain unclear. We conducted a scoping review from a mathematics - education perspective, including three parts: a technical overview, a systematic review of word problems used in research, and a state - of - the - art empirical evaluation of LLMs on mathematical word problems. First, in the technical overview, we contrast the conceptualization of word problems and their solution processes between LLMs and students. In computer - science research this is typically labeled mathematical reasoning, a term that does not align with usage in mathematics education. Second, our literature review of 213 studies shows that the most popular word - problem corpora are dominated by s - problems, which do not require a consideration of realities of their real - world context. Finally, our evaluation of GPT - 3.5 - turbo, GPT - 4o - mini, GPT - 4.1, o3, and GPT - 5 on 287 word problems shows that most recent LLMs solve these s - problems with near - perfect accuracy, including a perfect score on 2 0 problems from PISA. LLMs still showed weaknesses in tackling problems where the real - world context is problematic or non - sensical. In sum, we argue based on all three aspects that LLMs have mastered a superficial solution process but do not make sense of word problems, which potentially limits their value as instructional tools in mathematics classroom s. Keywords LLM; word - problem solving; AI; mathematical reasoning; modelling 1 Introduction In the last couple of years, the rapid improvement of Large Language Models (LLMs) has led to an unprecedented interest in educational research in artificial intelligence in general, and of LLMs in particular (Kasneci et al., 2023) . However, while LLMs excel at producing, translating and reviewing text, they are not natively designed for processing numerical information, calculating, or proving (Chang et al., 2024) . C ompared to other tasks, solving mathematical problems is relatively difficult for LLMs (Testolin, 2024) . This is also true for mathematical word - problems solving.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.24006

Country: Europe > Germany (0.46)

Genre: Research Report > New Finding (0.46)

Industry:

Education > Curriculum > Subject-Specific Education (1.00)
Education > Educational Setting (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback